The impact of near domain transfer on biomedical named entity recognition

نویسندگان

  • Nigel Collier
  • Mai-Vu Tran
  • Ferdinand Paster
چکیده

s 110 80 Tokens 27,421 26,578 Av. length 32.57 29.93 ANA 194 195 0.33 0.26 (138) (133) CHE 44 147 0.08 0.07 (33) (75) DIS 892 955 0.39 0.27 (282) (442) GGP 1663 754 0.41 0.45 (928) (511) ORG 799 770 0.56 0.67 (429) (323) PHE 507 1430 0.52 0.33 (423) (1113) Table 2: Characteristics of the C1 auto-immune and C2 cardiovascular corpora: number of abstracts, number of tokens, average sentence length, frequency of each entity type. Figures in parentheses represent counts after removing duplication. a: probability that a word in an entity class X in C1 is also a word in entity class X in C2. b: probability that a word in an entity class X in C2 is also a word in entity class X in C1 (3) We calculated from Table 2 the average number of mentions for each entity form by class and noted that this is relatively stable across corpora, except for DIS which has less variation in C2 than C1 and CHE which has more variation in C2 than C1. When combining evidence from both corpora the approximate order of type/token ratio are PHE < ANA < CHE,GGP < ORG < DIS indicating that on average PHE entities have the greatest variation. Average entity lengths in tokens (not shown) indicate that PHE are significantly longer than other entity mentions; and (4) We calculated the probability that a word token in an entity class from one corpus would appear in an instance of the same entity class in the other corpus, reported as columns a and b. Although the probability of an exact match in instances between entities in the two corpora is generally quite low (below 20% data not shown) there appears to be significant vocabulary overlap in most classes except for chemicals. 3.2 Conditional Random Fields As in (Finkel and Manning, 2009) we apply our approach to a linear chain conditional random field (CRF) model (Lafferty et al., 2001; McCallum and Wei, 2003; Settles, 2004; Doan et al., 2012) using the Mallet toolkit1 with default parameters. CRFs have been shown consistently to be among the highest performing bioNER learners. The data selection strategies employed here though are neutral and could have been applied to any other fully supervised learner model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

سیستم شناسایی و طبقه‌بندی موجودیت‌های اسمی در متون زبان فارسی بر پایه شبکه عصبی

Named Entity Recognition (NER) is a fundamental task in natural language processing and also known as a subset of information extraction. We seek to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, etc. Named Entity Recognition for English texts has been researched widely for the past years, howev...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

بهبود شناسایی موجودیت‌های نامدار فارسی با استفاده از کسره اضافه

Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

Enhancing HMM-based biomedical named entity recognition by studying special phenomena

The purpose of this research is to enhance an HMM-based named entity recognizer in the biomedical domain. First, we analyze the characteristics of biomedical named entities. Then, we propose a rich set of features, including orthographic, morphological, part-of-speech, and semantic trigger features. All these features are integrated via a Hidden Markov Model with back-off modeling. Furthermore,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014